Avoid Scraper Trapper
requests library enables use to handle form on website, is algo good at setting headers. HTTP headers contains attributes, preferences, sent by you every time you make a request to server.
Header used by a typical Python scraper using the default url lib library might send:
Accept-Encoding: identity
User-Agent: Python-urllib/3.4
Good website, https://www.whatismybrowser.com , to test browser properties viewable by server.
Usually the one setting, that really matters for websites to check for “humanness” based on, is “User-Agent”.
Headers change bring a lot conveince
let’s say you need some Chinese material, just simply changing Accept-language: en-US to Accept-Language: zh.
Mobile devices have a different version of web page, so set as this:
User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X AppleWebKi/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257)
brings a great change.
Handling Cookies
Cookies can keep you logged in on a site.
There are a number of browser plug-ins that can show you how cookies are being set as you visit and move, https://www.editthiscookie.com/ , a Chrome extension, is very good.
Request library will be unsable to handle many of the cookies produced by modern software, use Selenium and PhantomJS packages.
|
|
Timing Is Eveything
Even sometimes use multithreaded jobs can make your scraper faster than one single thread, but keep individual page loads and data requests to a minimum, can try to space them by a few seconds,
time.sleep(3).
Reference:
Book: Web Scraping with Python: Collecting Data from the Modern Web.